In silico approaches for the identification of potential allergens among hypothetical proteins from Alternaria alternata and its functional annotation

Direct exposure to the fungal species Alternaria alternata is a major risk factor for the development of asthma, allergic rhinitis, and inflammation. As of November 23rd 2020, the NCBI protein database showed 11,227 proteins from A. alternata genome as hypothetical proteins (HPs). Allergens are the main causative of several life-threatening diseases, especially in fungal infections. Therefore, the main aim of the study is to identify the potentially allergenic inducible proteins from the HPs in A. alternata and their associated functional assignment for the complete understanding of the complex biological systems at the molecular level. AlgPred and Structural Database of Allergenic Proteins (SDAP) were used for the prediction of potential allergens from the HPs of A. alternata. While analyzing the proteome data, 29 potential allergens were predicted by AlgPred and further screening in SDAP confirmed the allergic response of 10 proteins. Extensive bioinformatics tools including protein family classification, sequence-function relationship, protein motif discovery, pathway interactions, and intrinsic features from the amino acid sequence were used to successfully predict the probable functions of the 10 HPs. The functions of the HPs are characterized as chitin-binding, ribosomal protein P1, thaumatin, glycosyl hydrolase, and NOB1 proteins. The subcellular localization and signal peptide prediction of these 10 proteins has further provided additional information on localization and function. The allergens prediction and functional annotation of the 10 proteins may facilitate a better understanding of the allergenic mechanism of A. alternata in asthma and other diseases. The functional domain level insights and predicted structural features of the allergenic proteins help to understand the pathogenesis and host immune tolerance. The outcomes of the study would aid in the development of specific drugs to combat A. alternata infections.


Sequence retrieval and dataset analysis
The complete HP sequences of A. alternata were retrieved from the NCBI database using their primary accession numbers in FASTA format.The sequences of all 11,227 HPs were subjected to the computational prediction for the identification of potential allergens.Furthermore, the identified allergens are functionally annotated using a well-optimized series of bioinformatics tools.

Allergen prediction: AlgPred and SDAP
AlgPred, a web-based (http:// www.imtech.res.in/ ragha va/ algpr ed) allergen prediction tool was used to predict the possible allergenic proteins present among 11,227 HPs of A. alternata 23 .AlgPred also predicts the potential IgE epitopes in the subjected 11,227 HPs.The tool uses different approaches for the prediction of allergenic proteins, which includes motif based techniques, machine learning and hybrid approach 24 .The protein predicted to be an allergen by most of the approaches has a high probability to be an allergenic protein.The possible allergenic proteins predicted in the AlgPred tool are subjected to the Structural Database of Allergic Proteins (SDAP), to investiage the cross-reactivity between known allergens (http:// fermi.utmb.edu/ sdap/) 25 .In order to determine the distantly related sequences, the physical-chemical amino acid descriptors E1-E5 were used to locate sequences with similar chemical properties.In E1-E5 descriptors, the similarities between the two sequences are examined with the property distance function PD.Each amino acid is represented as a vector and these vectors are generated by the method of metric multidimensional scaling of 237 physical-chemical properties for the naturally occurring 20 amino acids. www.nature.com/scientificreports/

Physicochemical properties and sub-cellular localization
ProtParam tool in the Expasy server (http:// web.expasy.org/ protp aram/) and PDB Goodies were employed to compute the physicochemical properties of HPs 26,27 .Theoretical calculation of various physicochemical properties such as molecular weight, aliphatic index, isoelectric point, instability index, extinction coefficient, and grand average of hydropathicity (GRAVY) was calculated for the selected 10 HPs (Table 1).The Wolf PSORT 28 and CELLO 29 servers were used to predict the subcellular localization of the potential allergens.Wolf PSORT converts given sequences into numerical localization features based on sorting signals, amino acid composition, and functional motifs.Upon conversion, the simple k-nearest neighbor classifier is used for the protein subcellular location prediction.CELLO uses a two-level SVM (Support Vector Machine) classifier and homology search method to annotate the sub-cellular localization of HPs (Table 2).

Sequence-based functional annotation
The identified allergens in A. alternata were extensively analyzed using CDD (Conserved Domain Database), InterPro, and Pfam to characterize the functional domains by utilizing the sequences of these HPs [30][31][32] .CDD inspects the functional characteristics of protein sequences by using the heuristics BLAST algorithm, and searches against a complete collection of domains to identify the structural and functional domains in the protein sequences 33 .InterPro scan combines multiple resources for motif discovery which predicts the information of protein domains, families, and functional sites.Protein sequence motifs are the signatures of protein families that are often used in predicting the function of the protein, especially in the case of metabolic enzymes; these motifs are associated with catalytic functions.Pfam is defined by multiple alignments and profile hidden Markov model (HMM) to define the family-representative sequences.Pfam uses the HMM algorithm to search the target sequence against the UniProt Knowledgebase (UniProtKB) to predict family relationships 34 .

Structure modelling and validation
The protein structural folds are highly conserved than sequences.Thus, structure-based functional annotation of the HPs are considered more reliable than sequence-based function assignment.The three-dimensional structure of predicted allergens was determined using the Phyre2 (comparative homology) 35 and Robetta (de novo) 36 .In the absence of the structural homology in repositories, Robetta builds the three-dimensional structure of the targeted allergic proteins by the de novo fragment insertion method.The Monte Carlo local structure search algorithm was used for energy minimization and optimization.Both knowledge-based and physically-derived scoring terms were used to score the quality of the generated models.PROCHECK program was used to validate  www.nature.com/scientificreports/ the reliability of the generated structures by analyzing the overall structure and residue-by-residue geometry of proteins 37 .ProQ, a neural network method was also used to predict the quality of the predicted structures 38 .Models showing a high LG score and MaxSub score were selected for function prediction studies.

Structure-based functional prediction
The predicted structures of the HPs are then used as similarity search queries in ProFunc and DALI servers for the structure-based function prediction 39,40 .ProFunc uses secondary structure elements (SSEs), SURFNET algorithm, residue conservation, and nest analysis on query structure to identify similar functional motifs or close associations to the experimentally annotated proteins.DALI uses a weighted sum of similarities of intramolecular distances to classify the structurally similar proteins in the PDB databases related to our input structure.The list of structural neighbors is sorted by pairwise structural similarity score (Z-score).A higher Z-score implies the structures agree more closely in architectural details.

Results and discussion
Advancements in the field of computational biology have developed several models namely the Hidden Markov Model (HMM), Neural Network (NN) model, and Support Vector Machine (SVM) to decode the biological phenomenon at the system level.The models and their associated methods are more efficient and accurate in annotating the functional properties of the proteins.We have used above-described models and methods to identify the potential allergens from the HPs of A. alternata.Further, the functions of the selected proteins were annotated based on their sequence and structural information.A total of 11,227 HPs of A. alternata were retrieved from the NCBI database and evaluated for their allergenicity using bioinformatics approaches.

Allergenic prediction
The predictions of allergenic proteins through computational approaches are an important phenomenon in the development of an effective vaccine and therapeutics in pharmaceutical industries.FAO/WHO (FAO: Food and Agriculture Organization of the United Nations; WHO: World Health Organization), Codex Alimentarius Commission guidelines (2003) have recommended various tests for examining and analyzing allergenic behavior of proteins which includes the origin of a gene, sequence similarities with a known allergen, protein stability and binding mechanism of IgE epitopes.AlgPred predicted protein sequences having more than 35% sequence similarity (over 80 amino acids) with known allergens designates a protein as a potential allergen (Table 3).Based on the AlgPred result, it was observed that 29 HPs are predicted as potential allergens.
Bioinformatics part of guidelines 2001 has documented that a protein is potentially allergenic if it either has at least six contiguous amino acids or a minimum of 35% sequence similarity over a window of 80 amino acids shared with known allergenic proteins.The 29 protein sequences predicted as allergens by AlgPred were further analyzed with SDAP.SDAP confirms 10 protein sequences as allergens and the remaining 19 protein sequences that do not fulfill the SDAP criteria were excluded from the study (Table 4).Among the 10 protein sequences, A0A177DEP8 and A0A4Q4N5B7 showed high sequence similarity (96.25%) with the allergen Alt a 12 of A. alternata.Alt a 1 to Alt a 12 are the well-known allergens of A. alternata.Alt a 12 comprises the structure of large ribosomal protein P1, which plays a distinct role in protein synthesis 7 .A0A4Q4NGZ8 and A0A4Q4NJR8 showed a 50% sequence similarity with Penicillium crustosum (Pen cr 26.0101) and Cupressus arizonica (Cup a 3).In general, P. crustosum are food spoilage microorganisms and also responsible for the production of mycotoxins, in which Pen cr 26 comes under ribosomal protein P1 41 .Cupressaceae family is responsible for the relevant cause of respiratory allergy including, rhino-conjunctivitis, hay fever and asthma in sensitized individuals 42 .Cup a 3 a major allergen in this family is reactive in more than 90% of the Cupressaceae patients 43 .A0A177D895 and A0A177DB16 shown high sequence similarity (47.50%) with Triticum aestivum (Tri a 18) and Gallus domesticus (Gla d).Tri a 18 is a minor allergen for patients with bakers' asthma 44 .A0A4Q4NI20 and A0A4Q4N975 showed 43.75% and 41.75% sequence similarity with Musa acuminate (Mus a 2.0101) and Hevea brasiliensis (Hev 5).Mus a 2 from bananas are classified under class 1 chitinase that belongs to pathogenesis-related protein (family 3), provoked positive skin prick test in 50% of banana allergic patients 45 .Hev b 5 has been identified as a major latex allergen and it is particularly observed among healthcare workers.The allergic reaction ranges from rhinitis to asthma, conjunctivitis, urticarial, anaphylactic shock, and occasionally death 46 .

A0A4Q4NJR8
The functional annotations of the potential allergens are listed in Table 5.The sequence-based analysis suggests that the HP (A0A4Q4NJR8) is localized in the extracellular region and may act as a thaumatin like protein family.BLASTP search showed that the HP belongs the thaumatin-like food allergen from Malus domestica that is associated with IgE-mediated symptoms in apple-allergic individuals 47 .The apple protein whose amino-terminal sequence shares about 50% identity with pathogenesis-related protein-5 family members was the first thaumatinlike protein described as allergen 48 .Family and conserved domain database strongly suggest that HP belongs to the thaumatin-like protein.The thaumatin-like proteins are also involved in host defense mechanisms and a wide range of developmental processes in fungi, plants, and animals 49 .HHpred also suggests a high similarity with thaumatin I from Thaumatococcus daniellli.Motif search using MotifFinder suggests that the HP sequence possesses a motif that is involved in the thaumatin protein family.Usually, large type thaumatin-like protein has 16 cysteine residues at conserved positions, and this characteristic feature was also observed in our HP 50 .These residues can take part up to 8 disulfide bridges, highly conserved in the thaumatin-like proteins.String database indicates that the HP showed maximum scoring function with Setosphaeria turcica and the result revealed several interaction partners such as glycoside hydrolase family 2 protein, glycoside hydrolase family 12   The three-dimensional structure of HP was predicted by Phyre2, which shows the sequence homology and identity of 69% and 38%, with the template (PDB ID: 2AHN).Model validation with the Ramachandran plot showed 84.9% of amino acid residues in the favored region and 0.5% of amino acid residues occupied the disallowed region.The LG-score of the HP model is − 0.835 showing that the predicted model is extremely reliable.Structure comparison and analysis revealed that the HP contains a lectin-like β barrel (Domain I), several loops (Domain II), and two beta sheets (Domain III), and all these three domains are stabilized through at least one disulfide bridge linked by up to one cysteine residues with a conserved spatial distribution throughout the protein 49 .Superimposition of the HP model with other thaumatin-like proteins showed the RMSD value of 0.660 Å (PDB ID: 3ZS3), 0.869 Å (PDB ID: 1DU5), 0.000 Å (PDB ID: 2AHN), respectively showing that HP belongs to the fold which is similar to that of 2AHN indicating a close functionality.The SuSPect tool embedded in the Phyre2 identified Cys290 amino acid residue has the highest mutational sensitivity, which has a functional/ phenotypic effect in the protein.Further, Pocket-Finder analysis shows the following amino acid residues Tyr263, Asp265, Asp266, Ile268, Gln269, Arg270, Pro271, and Asn283 that plays a major role as active site residues.
DALI server shows the high structural similarity of HP with the protein function similar to thaumatin-like proteins.We found a significant match with thaumatin-like protein (Z score: 39.0), Laminaripentaose producing beta-1,3-guluase (Z score: 14.9), Beta-1,3-glucanase (Z score: 12.6), etc.The aligned residues are usually in the range of 189-412 with the RMSD in the range of 0.9 Å to 2.9 Å.We also observed a close structural similarity with the beta-1,3-glucanase enzyme.Furthermore, the ProFunc server revealed the close similarity of HP with the

A0A177DEP8, A0A4Q4NGZ8 and A0A4Q4N5B7
The sequence (Interpro, CD search, and Pfam) based analysis strongly suggests that these HPs exist as ribosomal protein P1 and its subfamily represents the eukaryotic large ribosomal protein P1.Also, HHpred analysis showed high similarity with 60S acidic ribosomal protein P1.We found the localization of these HPs in the cytoplasm as predicted by Wolf PSORT and CELLO.The acidic ribosomal P proteins are small molecules (10-11 kDa) that form lateral stalk structures in the active site region of the large ribosomal subunit and play an important role in the elongation phase of the translation process 51 .Based on sequence homology, the ribosomal P proteins are classified into two types in mammals, yeast, and protozoans (P1 and P2), whereas, the third distinct group (P3) was observed in plants 52 .Furthermore, these HPs contain a structural motif that is found in the family of 60S acidic ribosomal protein.The functional partnership of three HPs was predicted using the STRING database which resulted in HPs (A0A177DEP8 and A0A4Q4N5B7) showing maximum scoring function with Mycosphaerella pini and partnership interaction with zinc-binding ribosomal protein S27e-like protein, 60S acidic ribosomal protein P0, and 0S ribosomal protein S21.Similarly, A0A4Q4NGZ8 showed maximum scoring function with Parastagonospora nodorum and exhibited associated functional interactions among eukaryotic ribosomal protein P1/P2 family, 60S acidic ribosomal protein P0, and universal ribosomal protein uS4 family.Based on these findings we suggest two HPs may function as 60s acidic ribosomal protein.
Three-dimensional structures of HPs (A0A177DEP8, A0A4Q4NGZ8, and A0A4Q4N5B7) were predicted using the Phyre2 server.These HPs showed high sequence similarity with the crystal structure of human ribosomal protein P1/P2 (PDB: ID-2LBF) and it was used as a template to predict the models.The predicted models were validated using the Ramachandran plot and it showed 82.1%, 84.4%, and 86.4% amino acids were present in the favored region respectively, and none of the residues occupied the disallowed region except A0A4Q4N5B7 protein (0.7%).The model quality was validated using ProQ, which showed an LG score of − 0.835 confirming its structural quality.Likewise, structural superimposition of the predicted models with template structure showed less RMS deviation of 0.14 Å, 0.14 Å, and 0.15 Å respectively, confirming the reliability of the predicted models.DALI analysis showed similar structures that belong to 60S acidic ribosomal protein P1.Likewise, ProFunc revealed the same result as predicted by DALI.Active site prediction shows that Trp43, Leu46, Phe47, Ala50, Leu51, Lys55, Asp58, Leu59, Asn62, Val63 are the important amino acids that are essential for catalyzing A0A177DEP8 and A0A4Q4N5B7.Also, the A0A4Q4NGZ8 active site may contain Met1, Ser2, Glu9, Gln10, Ala13, Trp47, Leu50, Phe51, Ala54, Leu55, Lys58 Glu62, Val63, Leu64, Thr65, Ala66, Val67, Thr68, Ala69, and Ala70.In addition, an earlier study reports that acidic ribosomal protein P1 from A. alternata is considered a major allergens and plays a role in fungal allergy and autoimmune disease.Moreover, it is categorized as a rich source of mold allergens and deposited in the WHO/IUIS database 53 .The present investigation strongly suggests that these three HPs may act as 60S acidic ribosomal protein P1 and classify as allergens with the virulent property.

A0A177D895, A0A177DB16 and A0A4Q4NI20
The sequence-based analysis including InterPro, Pfam, and CD search revealed that the HPs A0A177D895, A0A177DB16, and A0A4Q4NI20 may act as chitin recognition protein or ChtBD1_1 domain-containing protein.Also, HHpred analysis showed maximum similarity with cysteine-rich and chitin-binding proteins.Furthermore, these HPs contain a structural motif that is found in the Chitin recognition protein.The sequence-based analysis, suggests that HPs function as a chitin-binding protein.Wolf PSORT and CELLO, predict these HPs present in the extracellular region and insoluble.Chitin Binding Proteins (CBP) are involved in various biological reactions such as hydrophobic surface sensing, binding to chitin, antimicrobial activities, and increasing chitinolytic activity [54][55][56] .It is commonly found in the exoskeleton of arthropods, nematodes, protozoa, insects, mollusks, and fungal cell walls.Based on the chitin-binding property and amino acids similarity the carbohydrate-binding modules are classified into several families including 1, 2, 12, 14, 18, 19, and 33 57 .Chitin binding proteins mainly catalyze the chitin degradation mechanism and its action varies from fungi to other organisms.In addition, the presence of discrete domains in enzymes, and chitin-binding modules also exist as independent and non-catalytic.Such non-catalytic CBPs are mostly found in 14, 18, and 33 families 58 .
Due to the unavailability of the appropriate template, the three-dimensional structures of these HPs were predicted using the Robetta server.The quality of the structures and their accuracy were validated using the Ramachandran plot, and it showed 87.6%, 88.1%, and 87.7% of residues occupied the favored region respectively and except A0A177DB16 (0.2%) and no residues occupied the disallowed region suggested a good quality of the predicted model.LG score of − 0.835, implies the predicted models are valid with high confidence.ProFunc and DALI analysis revealed that A0A177D895 may act as chitin recognition and it is involved in a variety of biological reactions.Despite that, the other two A0A177DB16 and A0A4Q4NI20 proteins showed no significant function due to their high structural and sequence variation as compared with A0A177D895.In addition, no similar hits were obtained from ProFunc and DALI analysis.Habitually chitin, chitinases, and chitin-binding proteins produce allergenic inflammation as well as wound inducible activity.In addition, chitin-binding proteins are also classified as pathogenesis-related proteins which include prohevein and other wound-inducible proteins.Prohevein, is a cysteine-rich protein and one of the major IgE-binding allergens that affect healthcare workers in natural rubber latex.Earlier studies reported the herein protein has significant similarities with (about 71%) chitin-binding proteins which is the reason behind latex allergic patients 59,60 .Hence, the present study investigation concludes that HPs act as chitin-binding proteins and induce allergenic reactions in humans as well as cause asthmatic inflammation.www.nature.com/scientificreports/

A0A177DU49 and A0A4Q4NRZ2
HPA0A177DU49 and HPA0A4Q4NRZ2 are localized in the nuclear system.BLASTP sequence analysis suggested its activity as 20S-pre-rRNA-d site endonuclease Nin One Binding (NOB1).Furthermore, sequence-based functional prediction clearly states that HPs are the Nin One Binding (NOB1) and the virulence prediction indicates the HPs are involved in the cellular process.The 20S pre-rRNA is converted into the mature 18S rRNA in the cytoplasm due to the action of NOB1 endonuclease at site D 61 .This NOB1 contains a PilT N-terminus (PIN) domain common to many other exonucleases or endonucleases and a zinc ribbon domain.In, general, PIN domain protein has been shown to possess endonucleolytic activity 62 .Uniprot molecular function suggests that HPs possess endoribonuclease activity.String database indicates that the HP showed maximum scoring function with Pyrenophora triticirepentis and the result revealed several interaction partners such as bystin, pre-rRNA processing protein pno1, serine/threonine-protein kinase RIO2/RIO3, low-temperature viability protein Itv1, periodic tryptophan protein 2, U3 small nucleolar ribonucleoprotein IMP4, rRNA biogenesis protein RRP5, and GTP binding protein Bms1.MEME suite analysis suggests the presence of three significant motifs in the sequences namely 68ʹ-CHACFNIDFQMDKQFCKRC, 471ʹ-CNNDSPARYDAYAAFCKKKGAH AVGLMQD, 515ʹ-HPWEKMGDKY for both HPs.The active site region of the HPs are observed to Glu8, Ile10, Gly11, Glu12, Gly13, Thr14, Tyr15, Val18, Lys20, Ala31, Lys33, Val64, Phe80, Glu81, Phe82, Leu83, His84, Gln85, Asp86, Lys88, Lys89, His125, Asp127, Lys129, Pro130, Gln131, Asn132, Leu134, Ala144, Asp145, Ala149, Val154, Thr158, Glu162, Val163, Val164, Thr165, Trp167, Tyr168, and Leu298.Both the HPs showed 100% sequence identity between them, therefore HP (A0A177DU49) alone was taken for structure prediction.Due to the unavailability of a reliable template, the structure of HP was predicted through an ab initio algorithm using the Rosetta server.Model validation with the Ramachandran plot showed 91.5% of amino acid residues in the favored region and 0.2% of amino acid residues occupied the disallowed region, showing high fold similarity with the template.The secondary structure prediction shows that HPs consist of numerous alpha-helices connecting through loops.The structure similarity using the DALI server shows a model that is similar to pre-18S ribosomal RNA (Z score = 29.3,RMSD = 1.5 Å), putative toxin VAPC6 (Z score = 11.7,RMSD = 4.2 Å) and Ribonuclease VAPC30 (Z score = 9.4,RMSD = 3.1 Å) etc.Moreover, structurebased function prediction using ProFunc shown that the protein may act as endonuclease nob1.Both sequence and structure-based analysis indicate that these HPs function as Nin One Binding.

A0A4Q4N975
A0A4Q4N975 is predicted to be localized in mitochondria and extracellular as suggested by WoLF PSORT and CELLO, respectively.There is no transmembrane helix present in the sequence of HP.The motif and domain analysis suggest that the HP is a glycosyl hydrolase.The members of this glycosyl hydrolases family of enzymes have been identified in bacteria, fungi, and plants, and play key roles in different aspects of life ranging from developmental processes to host-pathogen interactions 63 .Sequence similarity search also suggests that this HP belongs to the glycoside hydrolases family 17 protein.The predicted partners for HP are endo-beta-1,3-glucanase and class III chitinase (belongs to the glycosyl hydrolase 18 families).
Due to the unavailability of any reliable template in the PDB, Rosetta was used to predict the model.The predicted model shows 86.0% of amino acid residues in the allowed region and 1.0% of residues in the disallowed region of the Ramachandran plot.Rosetta server was not able to completely predict the secondary structural elements of the HP; hence ProFunc could not able to predict the HP function.The structure similarity using the DALI server shows a model that is similar to 6FCG (Z score = 43.2,RMSD = 1.1 Å), 4WTP (Z score = 28.2,RMSD = 2.1 Å), and 3UR8 (Z score = 24.1,RMSD = 2.5 Å).Based on the sequence and structural analysis, the HP may function as glycosyl hydrolases.

Conclusion
In the last decade, an enormous challenge has been made in characterizing the hypothetical proteins present in the genome.The functional assignments helps to understand the molecular biology at the system level and also identify potential drug targets, which can specifically act on pathogens to combat the pathogenicity.In this present study, computational analysis was performed to analyze allergic assessments of hypothetical proteins in A. alternata.Based on the analysis, 10 proteins were predicted as potential allergens.Furthermore, we have characterized the functions of these HPs with a high level of confidence using various bioinformatics approaches.The predicted functions of the HPs are chitin binding, ribosomal protein P1, thaumatin-like protein, glycosyl hydrolase, and Nob1 Zn binding protein.The physicochemical properties of the proteins help in the characterization of protein function, whereas subcellular localization of the proteins plays a pivotal role in differentiating the vaccine and drug targets.This study provided a basic understanding of the potential allergens and could aid in the development of novel therapeutics to counterattack A. alternata and other associated fungal allergic infections.

Table 1 .
Physico-chemical properties predicted for the potential allergens of A. alternata.

Table 2 .
Sub-cellular localization annotation of hypothetical proteins of A. alternata.

Table 3 .
Prediction of potential allergens among hypothetical proteins in Alternaria alternata using AlgPred tool.

Table 4 .
Screening of potential allergens in SDAP, representing its percentage identity with an allergen over a window of 80 amino acids. S.

Table 5 .
Sequence and structure-based functional annotation of potential allergens from the HPs of A. alternata.structure of allergenic and antifungal banana fruit thaumatin-like protein.An extensive sequence and structural analysis strongly suggest that the HP could function as a thaumatin-like protein.